Enable split mode graph for on-the-fly merged up/gate experts#1413
Enable split mode graph for on-the-fly merged up/gate experts#1413
Conversation
…ow#1413 Split mode graph for on-the-fly merged ffn_up/gate_exps Cleanup Also handle merged bias
|
Is on-the-fly merged ffn_up/gate_exps ggufs faster than just using --merge-up-gate-experts on a non-merged ggufs? |
It is the same thing. I started using "on-the-fly merged" for |
|
I'm confused. Is that means using pre-merged ggufs is same as using non-merged ggufs with option |
Yes, it is the same. In the pre-merged case, someone (for instance AesSedai) has prepared the model such that the |
CUDA_VISIBLE_DEVICES=6,7 numactl --cpunodebind=1 --membind=1 ./ik_llama.cpp-b4283/llama-sweep-bench -rtr -b 2048 -ub 2048 --split-mode graph --cache-type-k q8_0 --cache-type-v q8_0 --n-gpu-layers 999 --model Qwen3.5-35B-A3B-Q5_K_M-00001-of-00002.gguf --mmproj mmproj-Qwen3.5-35B-A3B-BF16.gguf --ctx-size 225280 --threads 8 --tensor-split 1,1 --numa isolate
CUDA_VISIBLE_DEVICES=6,7 numactl --cpunodebind=1 --membind=1 ./ik_llama.cpp-b4283/llama-sweep-bench -rtr -b 2048 -ub 2048 --split-mode graph --cache-type-k q8_0 --cache-type-v q8_0 --n-gpu-layers 999 --model Qwen3.5-35B-A3B-Q5_K_M-00001-of-00002.gguf --mmproj mmproj-Qwen3.5-35B-A3B-BF16.gguf --ctx-size 225280 --threads 8 --tensor-split 1,1 --numa isolate -muge
CUDA_VISIBLE_DEVICES=6,7 numactl --cpunodebind=1 --membind=1 ./ik_llama.cpp-b4283/llama-sweep-bench -rtr -b 2048 -ub 2048 --split-mode graph --cache-type-k q8_0 --cache-type-v q8_0 --n-gpu-layers 999 --model Qwen3.5-35B-A3B-Q5_K_M-merge-gate-up-00001-of-00002.gguf --mmproj mmproj-Qwen3.5-35B-A3B-BF16.gguf --ctx-size 225280 --threads 8 --tensor-split 1,1 --numa isolate
CUDA_VISIBLE_DEVICES=6,7 numactl --cpunodebind=1 --membind=1 ./ik_llama.cpp-b4283/llama-sweep-bench -rtr -b 2048 -ub 2048 --split-mode graph --cache-type-k q8_0 --cache-type-v q8_0 --n-gpu-layers 999 --model Qwen3.5-35B-A3B-Q5_K_M-merge-gate-up-00001-of-00002.gguf --mmproj mmproj-Qwen3.5-35B-A3B-BF16.gguf --ctx-size 225280 --threads 8 --tensor-split 1,1 --numa isolate -muge
|
|
In my experience, running If I don't use I suppose it is due to the "unexpected results if using custom tensor offloads with split-mode graph" warning. I am thankful that it still works very well without merging up and gate expert tensors, so this is just a drawback of using custom tensor offloading. |
|
I'm not 100% sure, but digging through some recent PRs on imatrix fused up|gate tensors and the original Same thing if you're using a mainline pre-merged quant... ik was gracious and re-named the existing convention here to reduce confusion with the new opposite naming convention on mainlin... so it is |
|
@ubergarm, thanks for the tips. This will help when using models with these experts merged (I will be using a catch-all regex What I am seeing when offloading the up and gate tensors with a non-merged gguf (the one in your repo), is that they are merged after they are loaded to the device. If I use So that is not the issue. The output still repeats itself, so using |

Once at it, this PR is a follow up of #1412. It enables usage of on-the-fly merged
ffn_up/gate_expstensors (-mugecommand line option) with split modegraph.On a 2x3090 system, I see ~10% better PP for the few models I tested.
As a reminder: add
-sm graph -mugeto the command line to get the benefit of this PR.Here a sweep-bench for GPT-OSS-20B-MXFP4 on the 2x3090 system. The
llama.cppresults are with build 8314.